Corpus-based Learning for Information Extraction
نویسنده
چکیده
This paper presents an integrated framework to extract generic information from multidomain texts. It shows that the AFP newswire exhibits some regularities that can be processed by a collection of wrappers. It also presents a set of linguistic resources able to extract generic information from texts. Lastly, the paper presents a collection of machine learning techniques allowing to extract some more specific information from a document repository. The result of the analysis is the creation of a generic event-based template per document, reachable via hypertextual and graphical interfaces.
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملYap, Willy and Timothy Baldwin (2009) Experiments on Pattern-based Relation Learning, in Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China
Relation extraction is the task of extracting semantic relations— such as synonymy or hypernymy—between word pairs from corpus data. Past work in relation extraction has concentrated on manually creating templates to use in directly extracting word pairs for a given semantic relation from corpus text. Recently, there has been a move towards using machine learning to automatically learn these pa...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملWI&CRF: روش پیشنهادی برای استخراج اطلاعات مورد نیاز از متون نظامی
Military Information Extraction techniques are interested for military managers and commanders. But usual information extraction techniques cannot be used for that domain, because military corpus has special structure that differs from non-military corpus. In this paper the military documents structure is compared with non-military documents structure. Moreover a new classification is proposed ...
متن کاملLearning Patterns for Information Extraction from Free Text
We describe a general approach to the task of information extraction from free text and propose methods for learning syntax patterns automatically from annotated corpora. We study the application of our approach to the extraction of protein-protein interactions from scientific texts. Based on this evaluation, we find that learning patterns outperforms techniques based on handcrafted patterns.
متن کاملA Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning
In this paper we present a methodology for the semantic annotation of domain-specific corpora. This method relies on a domain ontology used initially for identifying and annotating domainspecific instances within the corpus. A machine learning-based information extraction system is then trained on the annotated corpus. The final result of this process is a model which is used to annotate new co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007